Hierarchical Clustering¶
Author: Nico Kuijpers
Date: March 28, 2021
Updated by Jacco Snoeren (July 2023)
Introduction¶
Hierarchical clustering is one of the machine learning algorithms that can be applied for unsupervised learning. In this notebook we give an example of how to apply agglomerative clustering on the Iris dataset.
First import the libraries we need.
import numpy as np
import pandas as pd
import sklearn as sk
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
print('numpy version:', np.__version__)
print('pandas version:', pd.__version__)
print('scikit-learn version:', sk.__version__)
print('seaborn version:', sns.__version__)
print('matplotlib version:', matplotlib.__version__)
%matplotlib inline
numpy version: 1.26.4 pandas version: 2.2.1 scikit-learn version: 1.4.1.post1 seaborn version: 0.13.2 matplotlib version: 3.8.3
📦 Data provisioning¶
To illustrate hierarchical clustering we use the Iris dataset. The dataset consists of 149 entries, 4 input features, and 1 output label. The data set consists of about 50 samples from each of three species of Iris: Iris setosa, Iris virginica, and Iris versicolor. Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
During clustering, we ignore the labels (unsupervised learning). We can compare the results of clustering with the labels afterwards.
For more information on the Iris dataset, see https://en.wikipedia.org/wiki/Iris_flower_data_set
# Download the Iris dataset from the internet
columns = ["Sepal Length", "Sepal Width", "Petal Length", "Petal Width", "Species"]
df_iris = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", names=columns)
📃 Sample the data¶
Get a first impression of the dataset by printing the data format and showing the first 5 rows and last 5 rows of the DataFrame.
# Explore the Iris dataset
df_iris.columns = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Species']
print('Iris dataset shape: {}'.format(df_iris.shape))
df_iris.head(5)
Iris dataset shape: (150, 5)
| Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
df_iris.tail(5)
| Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
|---|---|---|---|---|---|
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
Print the different species in the dataset.
print(df_iris['Species'].unique())
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
Print the number of flowers for each species and visualize these numbers using a bar plot.
print(df_iris['Species'].value_counts())
df_iris['Species'].value_counts().plot(kind='bar')
Species Iris-setosa 50 Iris-versicolor 50 Iris-virginica 50 Name: count, dtype: int64
<Axes: xlabel='Species'>
Preprocessing¶
Method pandas.DataFrame.info() prints information about a DataFrame including the index dtype and columns, non-null values, and memory usage. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html.
df_iris.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sepal Length 150 non-null float64 1 Sepal Width 150 non-null float64 2 Petal Length 150 non-null float64 3 Petal Width 150 non-null float64 4 Species 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB
Method pandas.DataFrame.describe() generates descriptive statistics. These include central tendency, dispersion,
and shape of a dataset's distribution, excluding NaN values.
See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe
df_iris.describe()
| Sepal Length | Sepal Width | Petal Length | Petal Width | |
|---|---|---|---|---|
| count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
| mean | 5.843333 | 3.054000 | 3.758667 | 1.198667 |
| std | 0.828066 | 0.433594 | 1.764420 | 0.763161 |
| min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
| 25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
| 50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
| 75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
| max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
Analyse the dataset using a box-and-whisker plot generated by method pandas.DataFrame.boxplot().
See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html
For more information on box plots, see https://en.wikipedia.org/wiki/Box_plot
From the box plot below, it can be observed that for Sepal Length and Sepal Width, there is some overlap in values for the three different species. Petal Length and Petal Width show less overlap. This information may be useful when selecting features.
Note: by default, the box plot will be partly shown and a scroll bar appears. To view the entire box plot, select Cell → All Output → Toggle Scrolling.
iris_features = tuple(df_iris.columns[:4].values)
df_iris.boxplot(column=iris_features, by='Species', figsize=(15,8), layout=(1,4));
Plot pairwise relationships using method seaborn.pairplot.
See https://seaborn.pydata.org/generated/seaborn.pairplot.html
plt.figure(figsize=(8,8))
ax = sns.pairplot(df_iris, hue='Species')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
<Figure size 800x800 with 0 Axes>
💡 Feature selection¶
Use all 4 features for clustering.
From the box plots it can be observed that the values range between 0 and 8 cm and that the distribution
differs per feature.
For instance, Sepal Length ranges between 4 and 8 cm, while Petal Width ranges between 0 and 3 cm.
When applying K-means clustering it is important to normalize the data. Using the StandardScaler,
the standard score of a sample $x$ is calculated as $z=(x-u)/s$, where $u$ is the mean and $s$ is
the standard deviation.
from sklearn.preprocessing import StandardScaler
# Define X_iris and y_iris
X_iris = df_iris[['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width']]
y_iris = df_iris['Species']
# Create an array of information for each case
case_info = [
{'features': ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'], 'normalize': False},
{'features': ['Sepal Length', 'Sepal Width'], 'normalize': False},
{'features': ['Petal Length', 'Petal Width'], 'normalize': False}
]
# Initialize an empty list to store the resulting arrays
resulting_arrays = []
# Iterate over each case
for info in case_info:
selected_features = info['features']
X_iris_case = df_iris[selected_features]
# Normalize the data if needed
if info['normalize']:
scaler_iris = StandardScaler().fit(X_iris_case)
X_iris_normalized = scaler_iris.transform(X_iris_case)
else:
X_iris_normalized = X_iris_case.to_numpy()
# Reshape X_iris_normalized into a 3x3 array
# X_iris_3x3 = np.reshape(X_iris_normalized, (3, 3))
# Append the resulting array to the list
resulting_arrays.append(X_iris_normalized)
for i, array in enumerate(resulting_arrays):
print(f"Array for case {i}:\n{array}\n")
Array for case 0: [[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2] [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5. 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1] [5.4 3.7 1.5 0.2] [4.8 3.4 1.6 0.2] [4.8 3. 1.4 0.1] [4.3 3. 1.1 0.1] [5.8 4. 1.2 0.2] [5.7 4.4 1.5 0.4] [5.4 3.9 1.3 0.4] [5.1 3.5 1.4 0.3] [5.7 3.8 1.7 0.3] [5.1 3.8 1.5 0.3] [5.4 3.4 1.7 0.2] [5.1 3.7 1.5 0.4] [4.6 3.6 1. 0.2] [5.1 3.3 1.7 0.5] [4.8 3.4 1.9 0.2] [5. 3. 1.6 0.2] [5. 3.4 1.6 0.4] [5.2 3.5 1.5 0.2] [5.2 3.4 1.4 0.2] [4.7 3.2 1.6 0.2] [4.8 3.1 1.6 0.2] [5.4 3.4 1.5 0.4] [5.2 4.1 1.5 0.1] [5.5 4.2 1.4 0.2] [4.9 3.1 1.5 0.1] [5. 3.2 1.2 0.2] [5.5 3.5 1.3 0.2] [4.9 3.1 1.5 0.1] [4.4 3. 1.3 0.2] [5.1 3.4 1.5 0.2] [5. 3.5 1.3 0.3] [4.5 2.3 1.3 0.3] [4.4 3.2 1.3 0.2] [5. 3.5 1.6 0.6] [5.1 3.8 1.9 0.4] [4.8 3. 1.4 0.3] [5.1 3.8 1.6 0.2] [4.6 3.2 1.4 0.2] [5.3 3.7 1.5 0.2] [5. 3.3 1.4 0.2] [7. 3.2 4.7 1.4] [6.4 3.2 4.5 1.5] [6.9 3.1 4.9 1.5] [5.5 2.3 4. 1.3] [6.5 2.8 4.6 1.5] [5.7 2.8 4.5 1.3] [6.3 3.3 4.7 1.6] [4.9 2.4 3.3 1. ] [6.6 2.9 4.6 1.3] [5.2 2.7 3.9 1.4] [5. 2. 3.5 1. ] [5.9 3. 4.2 1.5] [6. 2.2 4. 1. ] [6.1 2.9 4.7 1.4] [5.6 2.9 3.6 1.3] [6.7 3.1 4.4 1.4] [5.6 3. 4.5 1.5] [5.8 2.7 4.1 1. ] [6.2 2.2 4.5 1.5] [5.6 2.5 3.9 1.1] [5.9 3.2 4.8 1.8] [6.1 2.8 4. 1.3] [6.3 2.5 4.9 1.5] [6.1 2.8 4.7 1.2] [6.4 2.9 4.3 1.3] [6.6 3. 4.4 1.4] [6.8 2.8 4.8 1.4] [6.7 3. 5. 1.7] [6. 2.9 4.5 1.5] [5.7 2.6 3.5 1. ] [5.5 2.4 3.8 1.1] [5.5 2.4 3.7 1. ] [5.8 2.7 3.9 1.2] [6. 2.7 5.1 1.6] [5.4 3. 4.5 1.5] [6. 3.4 4.5 1.6] [6.7 3.1 4.7 1.5] [6.3 2.3 4.4 1.3] [5.6 3. 4.1 1.3] [5.5 2.5 4. 1.3] [5.5 2.6 4.4 1.2] [6.1 3. 4.6 1.4] [5.8 2.6 4. 1.2] [5. 2.3 3.3 1. ] [5.6 2.7 4.2 1.3] [5.7 3. 4.2 1.2] [5.7 2.9 4.2 1.3] [6.2 2.9 4.3 1.3] [5.1 2.5 3. 1.1] [5.7 2.8 4.1 1.3] [6.3 3.3 6. 2.5] [5.8 2.7 5.1 1.9] [7.1 3. 5.9 2.1] [6.3 2.9 5.6 1.8] [6.5 3. 5.8 2.2] [7.6 3. 6.6 2.1] [4.9 2.5 4.5 1.7] [7.3 2.9 6.3 1.8] [6.7 2.5 5.8 1.8] [7.2 3.6 6.1 2.5] [6.5 3.2 5.1 2. ] [6.4 2.7 5.3 1.9] [6.8 3. 5.5 2.1] [5.7 2.5 5. 2. ] [5.8 2.8 5.1 2.4] [6.4 3.2 5.3 2.3] [6.5 3. 5.5 1.8] [7.7 3.8 6.7 2.2] [7.7 2.6 6.9 2.3] [6. 2.2 5. 1.5] [6.9 3.2 5.7 2.3] [5.6 2.8 4.9 2. ] [7.7 2.8 6.7 2. ] [6.3 2.7 4.9 1.8] [6.7 3.3 5.7 2.1] [7.2 3.2 6. 1.8] [6.2 2.8 4.8 1.8] [6.1 3. 4.9 1.8] [6.4 2.8 5.6 2.1] [7.2 3. 5.8 1.6] [7.4 2.8 6.1 1.9] [7.9 3.8 6.4 2. ] [6.4 2.8 5.6 2.2] [6.3 2.8 5.1 1.5] [6.1 2.6 5.6 1.4] [7.7 3. 6.1 2.3] [6.3 3.4 5.6 2.4] [6.4 3.1 5.5 1.8] [6. 3. 4.8 1.8] [6.9 3.1 5.4 2.1] [6.7 3.1 5.6 2.4] [6.9 3.1 5.1 2.3] [5.8 2.7 5.1 1.9] [6.8 3.2 5.9 2.3] [6.7 3.3 5.7 2.5] [6.7 3. 5.2 2.3] [6.3 2.5 5. 1.9] [6.5 3. 5.2 2. ] [6.2 3.4 5.4 2.3] [5.9 3. 5.1 1.8]] Array for case 1: [[5.1 3.5] [4.9 3. ] [4.7 3.2] [4.6 3.1] [5. 3.6] [5.4 3.9] [4.6 3.4] [5. 3.4] [4.4 2.9] [4.9 3.1] [5.4 3.7] [4.8 3.4] [4.8 3. ] [4.3 3. ] [5.8 4. ] [5.7 4.4] [5.4 3.9] [5.1 3.5] [5.7 3.8] [5.1 3.8] [5.4 3.4] [5.1 3.7] [4.6 3.6] [5.1 3.3] [4.8 3.4] [5. 3. ] [5. 3.4] [5.2 3.5] [5.2 3.4] [4.7 3.2] [4.8 3.1] [5.4 3.4] [5.2 4.1] [5.5 4.2] [4.9 3.1] [5. 3.2] [5.5 3.5] [4.9 3.1] [4.4 3. ] [5.1 3.4] [5. 3.5] [4.5 2.3] [4.4 3.2] [5. 3.5] [5.1 3.8] [4.8 3. ] [5.1 3.8] [4.6 3.2] [5.3 3.7] [5. 3.3] [7. 3.2] [6.4 3.2] [6.9 3.1] [5.5 2.3] [6.5 2.8] [5.7 2.8] [6.3 3.3] [4.9 2.4] [6.6 2.9] [5.2 2.7] [5. 2. ] [5.9 3. ] [6. 2.2] [6.1 2.9] [5.6 2.9] [6.7 3.1] [5.6 3. ] [5.8 2.7] [6.2 2.2] [5.6 2.5] [5.9 3.2] [6.1 2.8] [6.3 2.5] [6.1 2.8] [6.4 2.9] [6.6 3. ] [6.8 2.8] [6.7 3. ] [6. 2.9] [5.7 2.6] [5.5 2.4] [5.5 2.4] [5.8 2.7] [6. 2.7] [5.4 3. ] [6. 3.4] [6.7 3.1] [6.3 2.3] [5.6 3. ] [5.5 2.5] [5.5 2.6] [6.1 3. ] [5.8 2.6] [5. 2.3] [5.6 2.7] [5.7 3. ] [5.7 2.9] [6.2 2.9] [5.1 2.5] [5.7 2.8] [6.3 3.3] [5.8 2.7] [7.1 3. ] [6.3 2.9] [6.5 3. ] [7.6 3. ] [4.9 2.5] [7.3 2.9] [6.7 2.5] [7.2 3.6] [6.5 3.2] [6.4 2.7] [6.8 3. ] [5.7 2.5] [5.8 2.8] [6.4 3.2] [6.5 3. ] [7.7 3.8] [7.7 2.6] [6. 2.2] [6.9 3.2] [5.6 2.8] [7.7 2.8] [6.3 2.7] [6.7 3.3] [7.2 3.2] [6.2 2.8] [6.1 3. ] [6.4 2.8] [7.2 3. ] [7.4 2.8] [7.9 3.8] [6.4 2.8] [6.3 2.8] [6.1 2.6] [7.7 3. ] [6.3 3.4] [6.4 3.1] [6. 3. ] [6.9 3.1] [6.7 3.1] [6.9 3.1] [5.8 2.7] [6.8 3.2] [6.7 3.3] [6.7 3. ] [6.3 2.5] [6.5 3. ] [6.2 3.4] [5.9 3. ]] Array for case 2: [[1.4 0.2] [1.4 0.2] [1.3 0.2] [1.5 0.2] [1.4 0.2] [1.7 0.4] [1.4 0.3] [1.5 0.2] [1.4 0.2] [1.5 0.1] [1.5 0.2] [1.6 0.2] [1.4 0.1] [1.1 0.1] [1.2 0.2] [1.5 0.4] [1.3 0.4] [1.4 0.3] [1.7 0.3] [1.5 0.3] [1.7 0.2] [1.5 0.4] [1. 0.2] [1.7 0.5] [1.9 0.2] [1.6 0.2] [1.6 0.4] [1.5 0.2] [1.4 0.2] [1.6 0.2] [1.6 0.2] [1.5 0.4] [1.5 0.1] [1.4 0.2] [1.5 0.1] [1.2 0.2] [1.3 0.2] [1.5 0.1] [1.3 0.2] [1.5 0.2] [1.3 0.3] [1.3 0.3] [1.3 0.2] [1.6 0.6] [1.9 0.4] [1.4 0.3] [1.6 0.2] [1.4 0.2] [1.5 0.2] [1.4 0.2] [4.7 1.4] [4.5 1.5] [4.9 1.5] [4. 1.3] [4.6 1.5] [4.5 1.3] [4.7 1.6] [3.3 1. ] [4.6 1.3] [3.9 1.4] [3.5 1. ] [4.2 1.5] [4. 1. ] [4.7 1.4] [3.6 1.3] [4.4 1.4] [4.5 1.5] [4.1 1. ] [4.5 1.5] [3.9 1.1] [4.8 1.8] [4. 1.3] [4.9 1.5] [4.7 1.2] [4.3 1.3] [4.4 1.4] [4.8 1.4] [5. 1.7] [4.5 1.5] [3.5 1. ] [3.8 1.1] [3.7 1. ] [3.9 1.2] [5.1 1.6] [4.5 1.5] [4.5 1.6] [4.7 1.5] [4.4 1.3] [4.1 1.3] [4. 1.3] [4.4 1.2] [4.6 1.4] [4. 1.2] [3.3 1. ] [4.2 1.3] [4.2 1.2] [4.2 1.3] [4.3 1.3] [3. 1.1] [4.1 1.3] [6. 2.5] [5.1 1.9] [5.9 2.1] [5.6 1.8] [5.8 2.2] [6.6 2.1] [4.5 1.7] [6.3 1.8] [5.8 1.8] [6.1 2.5] [5.1 2. ] [5.3 1.9] [5.5 2.1] [5. 2. ] [5.1 2.4] [5.3 2.3] [5.5 1.8] [6.7 2.2] [6.9 2.3] [5. 1.5] [5.7 2.3] [4.9 2. ] [6.7 2. ] [4.9 1.8] [5.7 2.1] [6. 1.8] [4.8 1.8] [4.9 1.8] [5.6 2.1] [5.8 1.6] [6.1 1.9] [6.4 2. ] [5.6 2.2] [5.1 1.5] [5.6 1.4] [6.1 2.3] [5.6 2.4] [5.5 1.8] [4.8 1.8] [5.4 2.1] [5.6 2.4] [5.1 2.3] [5.1 1.9] [5.9 2.3] [5.7 2.5] [5.2 2.3] [5. 1.9] [5.2 2. ] [5.4 2.3] [5.1 1.8]]
Visualize the distribution of the data per feature after normalization.
# Define number of bins for histogram
nbins = 16
for i, array in enumerate(resulting_arrays):
# Number of selected features
nfeatures = len(case_info[i]['features'])
print(f"Number of selected features for case {i}: {nfeatures}")
# Plot histograms for each of the selected features
fig, axs = plt.subplots(1, nfeatures, figsize=(nfeatures * 4, 6))
for feature in range(nfeatures):
print(feature)
axs[feature].hist(array[:, feature], nbins)
axs[feature].set_title(case_info[i]['features'][feature]) # Use the correct feature name
plt.show()
Number of selected features for case 0: 4 0 1 2 3
Number of selected features for case 1: 2 0 1
Number of selected features for case 2: 2 0 1
🪓 Splitting into train/test¶
It's important here to realize that we do not split the data into train and test data. Clustering is (primarily) unsupervised, so we do not split the data into train and test.
Modelling¶
To perform hierarchical clustering, we use sklearn.cluster.AgglomerativeClustering.
See https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
agglom = []
for i, array in enumerate(resulting_arrays):
# Define number of clusters by setting distance threshold
agglom_append = AgglomerativeClustering(distance_threshold=10, n_clusters=None)
# Use this line for agglomerative clustering using normalized data
agglom_append.fit(array)
agglom.append(agglom_append)
kmean_array = []
for i, array in enumerate(resulting_arrays):
scaler = StandardScaler()
scaled_data = scaler.fit_transform(array)
kmeans = KMeans(n_clusters=3)
kmeans.fit(scaled_data)
kmean_array.append(kmeans)
DBSCAN_array = []
for i, array in enumerate(resulting_arrays):
scaler = StandardScaler()
scaled_data = scaler.fit_transform(array)
dbscan = DBSCAN(eps=0.45)
dbscan.fit(scaled_data)
DBSCAN_array.append(dbscan)
Number of clusters found by the algorithm. If parameter distance_threshold=None, it will
be equal to the given n_clusters.
for agg in agglom:
print('Number of clusters: ', agg.n_clusters_)
for kmean in kmean_array:
print('Number of clusters: ', kmean.n_clusters)
Number of clusters: 3 Number of clusters: 2 Number of clusters: 3 Number of clusters: 3 Number of clusters: 3 Number of clusters: 3
Cluster labels are stored in an ndarray of shape (n_samples).
for agg in agglom:
print(np.unique(agg.labels_))
for kmean in kmean_array:
print(np.unique(kmean.labels_))
for DBscan in DBSCAN_array:
print(np.unique(DBscan.labels_))
[0 1 2] [0 1] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [-1 0 1 2] [-1 0 1] [0 1]
Number of leaves in the hierarchical tree.
for agg in agglom:
print(agg.n_leaves_)
for kmean in kmean_array:
print(kmean.n_features_in_)
for DBscan in DBSCAN_array:
print(DBscan.n_features_in_)
150 150 150 4 2 2 4 2 2
for agg in agglom:
print(agg.distances_)
for kmean in kmean_array:
print(kmean.inertia_)
for DBscan in DBSCAN_array:
print(DBscan.components_)
[ 0. 0. 0. 0.1 0.1 0.1 0.1 0.14142136 0.14142136 0.14142136 0.14142136 0.14142136 0.14142136 0.14142136 0.14142136 0.14142136 0.14142136 0.14142136 0.14142136 0.14142136 0.14142136 0.17320508 0.17320508 0.17320508 0.17320508 0.17320508 0.17320508 0.18257419 0.18257419 0.2 0.2 0.2 0.2 0.21602469 0.21602469 0.2236068 0.2236068 0.24494897 0.24494897 0.24494897 0.24494897 0.24494897 0.24494897 0.24494897 0.25819889 0.26457513 0.26457513 0.26457513 0.26457513 0.26457513 0.27080128 0.28284271 0.28982753 0.29439203 0.29439203 0.29439203 0.29439203 0.30550505 0.31358146 0.31622777 0.32145503 0.33166248 0.33166248 0.33166248 0.33665016 0.34156503 0.34641016 0.34641016 0.34778209 0.35118846 0.35182066 0.35355339 0.35823642 0.36055513 0.36285902 0.36968455 0.37416574 0.37416574 0.4 0.41231056 0.41231056 0.41472883 0.42229532 0.43969687 0.43969687 0.44521263 0.4472136 0.45825757 0.46547467 0.46726153 0.48131764 0.48166378 0.48989795 0.51478151 0.52915026 0.5329165 0.53851648 0.54772256 0.57445626 0.58022984 0.59441848 0.60580525 0.61373175 0.6244998 0.63087241 0.6363961 0.64031242 0.64291005 0.66269651 0.70945989 0.72456884 0.73257537 0.73409052 0.73496032 0.75535128 0.76321688 0.80622577 0.82187359 0.82613558 0.83740671 0.85556999 0.85780728 0.86458082 0.92870878 0.92915732 1.00534287 1.04705937 1.10513951 1.15325626 1.21700908 1.29839645 1.3048627 1.39697997 1.41139074 1.49547731 1.5977663 1.75916647 1.76044502 1.84584475 1.86837719 1.91608028 2.05363058 2.81393883 2.86941764 3.8758436 4.84770851 6.39940682 12.30039605 32.42801258] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.11547005 0.11547005 0.11547005 0.11547005 0.11547005 0.11547005 0.11547005 0.11547005 0.11547005 0.11547005 0.12247449 0.12909944 0.12909944 0.12909944 0.12909944 0.12909944 0.12909944 0.14142136 0.14142136 0.14142136 0.14142136 0.15705625 0.15705625 0.15811388 0.15811388 0.15811388 0.16329932 0.16329932 0.17320508 0.17320508 0.17320508 0.18257419 0.2 0.2 0.2 0.2081666 0.22060523 0.2236068 0.2236068 0.2236068 0.2236068 0.23094011 0.23804761 0.23804761 0.23804761 0.27602622 0.27877282 0.28284271 0.28867513 0.29154759 0.29439203 0.29800927 0.31411251 0.31622777 0.31622777 0.35962944 0.36055513 0.39809068 0.43011626 0.43011626 0.43461349 0.43969687 0.44409021 0.45680047 0.46097722 0.50347995 0.53229065 0.54416092 0.55901699 0.57264481 0.58309519 0.6041523 0.6611678 0.71867934 0.72071393 0.73029674 0.7771203 0.82121314 0.84327404 0.93340774 0.95713693 1.08639311 1.14041903 1.22789863 1.4664443 1.62484125 1.75657495 2.03829469 2.05304652 2.42712703 2.58742277 4.43879833 4.68991092 6.09685802 11.68678645] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.11547005 0.11547005 0.11547005 0.11547005 0.11547005 0.12247449 0.12247449 0.12649111 0.12909944 0.12909944 0.12909944 0.12909944 0.12909944 0.12909944 0.12909944 0.14142136 0.14142136 0.14142136 0.14142136 0.14142136 0.15491933 0.16329932 0.16329932 0.16329932 0.16832508 0.16832508 0.17320508 0.17320508 0.18257419 0.18257419 0.18257419 0.18898224 0.2 0.2 0.2081666 0.21291626 0.21984843 0.21984843 0.2236068 0.23094011 0.2319688 0.23804761 0.23804761 0.23804761 0.23817488 0.24494897 0.26140645 0.26992062 0.27386128 0.28867513 0.30934411 0.31937439 0.35118846 0.35118846 0.35237291 0.36055513 0.36514837 0.36968455 0.37638633 0.41231056 0.43602891 0.49244289 0.49665548 0.50205925 0.518698 0.54693954 0.56082422 0.56441173 0.62948392 0.64888038 0.69321386 0.69863813 0.72341781 0.7787623 0.91215932 0.98961272 1.15368134 1.16045968 1.35549637 1.66704658 1.75080719 1.78728978 2.40933206 3.56917392 4.67443318 10.52767915 30.43899692] 141.15417813388655 103.81453420646659 18.046983891906272 [[-0.90068117 1.03205722 -1.3412724 -1.31297673] [-1.14301691 -0.1249576 -1.3412724 -1.31297673] [-1.38535265 0.33784833 -1.39813811 -1.31297673] [-1.50652052 0.10644536 -1.2844067 -1.31297673] [-1.02184904 1.26346019 -1.3412724 -1.31297673] [-1.02184904 0.80065426 -1.2844067 -1.31297673] [-1.14301691 0.10644536 -1.2844067 -1.4444497 ] [-1.26418478 0.80065426 -1.227541 -1.31297673] [-1.26418478 -0.1249576 -1.3412724 -1.4444497 ] [-0.90068117 1.03205722 -1.3412724 -1.18150376] [-0.90068117 1.72626612 -1.2844067 -1.18150376] [-0.53717756 0.80065426 -1.17067529 -1.31297673] [-0.90068117 1.49486315 -1.2844067 -1.05003079] [-1.26418478 0.80065426 -1.05694388 -1.31297673] [-1.02184904 -0.1249576 -1.227541 -1.31297673] [-1.02184904 0.80065426 -1.227541 -1.05003079] [-0.7795133 1.03205722 -1.2844067 -1.31297673] [-0.7795133 0.80065426 -1.3412724 -1.31297673] [-1.38535265 0.33784833 -1.227541 -1.31297673] [-1.26418478 0.10644536 -1.227541 -1.31297673] [-0.53717756 0.80065426 -1.2844067 -1.05003079] [-1.14301691 0.10644536 -1.2844067 -1.4444497 ] [-1.02184904 0.33784833 -1.45500381 -1.31297673] [-0.41600969 1.03205722 -1.39813811 -1.31297673] [-1.14301691 0.10644536 -1.2844067 -1.4444497 ] [-0.90068117 0.80065426 -1.2844067 -1.31297673] [-1.02184904 1.03205722 -1.39813811 -1.18150376] [-1.74885626 0.33784833 -1.39813811 -1.31297673] [-0.90068117 1.72626612 -1.05694388 -1.05003079] [-1.26418478 -0.1249576 -1.3412724 -1.18150376] [-0.90068117 1.72626612 -1.227541 -1.31297673] [-1.50652052 0.33784833 -1.3412724 -1.31297673] [-0.65834543 1.49486315 -1.2844067 -1.31297673] [-1.02184904 0.56925129 -1.3412724 -1.31297673] [ 1.2803405 0.10644536 0.64902723 0.39617188] [ 0.79566902 -0.58776353 0.47843012 0.39617188] [-0.17367395 -0.58776353 0.42156442 0.13322594] [ 0.91683689 -0.35636057 0.47843012 0.13322594] [ 0.06866179 -0.1249576 0.25096731 0.39617188] [ 0.31099753 -0.35636057 0.53529583 0.26469891] [-0.29484182 -0.35636057 -0.09022692 0.13322594] [-0.29484182 -0.1249576 0.42156442 0.39617188] [-0.29484182 -1.28197243 0.08037019 -0.12972 ] [ 0.67450115 -0.35636057 0.30783301 0.13322594] [ 0.91683689 -0.1249576 0.36469871 0.26469891] [ 0.18982966 -0.35636057 0.42156442 0.39617188] [-0.17367395 -1.05056946 -0.14709262 -0.26119297] [-0.41600969 -1.51337539 0.02350449 -0.12972 ] [-0.05250608 -0.8191665 0.08037019 0.00175297] [ 1.03800476 0.10644536 0.53529583 0.39617188] [-0.29484182 -0.1249576 0.1941016 0.13322594] [-0.41600969 -1.05056946 0.36469871 0.00175297] [ 0.31099753 -0.1249576 0.47843012 0.26469891] [-0.05250608 -1.05056946 0.1372359 0.00175297] [-0.29484182 -0.8191665 0.25096731 0.13322594] [-0.17367395 -0.1249576 0.25096731 0.00175297] [-0.17367395 -0.35636057 0.25096731 0.13322594] [ 0.4321654 -0.35636057 0.30783301 0.13322594] [-0.17367395 -0.58776353 0.1941016 0.13322594] [ 0.79566902 -0.1249576 1.16081857 1.31648267] [ 1.15917263 -0.1249576 0.99022146 1.1850097 ] [ 0.79566902 -0.1249576 0.99022146 0.79059079] [ 1.2803405 0.33784833 1.10395287 1.44795564] [ 1.2803405 0.10644536 0.93335575 1.1850097 ] [ 1.03800476 0.10644536 1.04708716 1.57942861] [ 1.2803405 0.10644536 0.76275864 1.44795564] [ 1.15917263 0.33784833 1.21768427 1.44795564] [ 1.03800476 -0.1249576 0.81962435 1.44795564] [ 0.79566902 -0.1249576 0.81962435 1.05353673]] [[-0.90068117 1.03205722] [-1.14301691 -0.1249576 ] [-1.38535265 0.33784833] [-1.50652052 0.10644536] [-1.02184904 1.26346019] [-0.53717756 1.95766909] [-1.02184904 0.80065426] [-1.14301691 0.10644536] [-0.53717756 1.49486315] [-1.26418478 0.80065426] [-1.26418478 -0.1249576 ] [-0.53717756 1.95766909] [-0.90068117 1.03205722] [-0.90068117 1.72626612] [-0.53717756 0.80065426] [-0.90068117 1.49486315] [-0.90068117 0.56925129] [-1.26418478 0.80065426] [-1.02184904 -0.1249576 ] [-1.02184904 0.80065426] [-0.7795133 1.03205722] [-0.7795133 0.80065426] [-1.38535265 0.33784833] [-1.26418478 0.10644536] [-0.53717756 0.80065426] [-1.14301691 0.10644536] [-1.02184904 0.33784833] [-0.41600969 1.03205722] [-1.14301691 0.10644536] [-0.90068117 0.80065426] [-1.02184904 1.03205722] [-1.74885626 0.33784833] [-1.02184904 1.03205722] [-0.90068117 1.72626612] [-1.26418478 -0.1249576 ] [-0.90068117 1.72626612] [-1.50652052 0.33784833] [-0.65834543 1.49486315] [-1.02184904 0.56925129] [ 1.40150837 0.33784833] [ 0.67450115 0.33784833] [ 1.2803405 0.10644536] [ 0.79566902 -0.58776353] [-0.17367395 -0.58776353] [ 0.55333328 0.56925129] [ 0.91683689 -0.35636057] [ 0.06866179 -0.1249576 ] [ 0.31099753 -0.35636057] [-0.29484182 -0.35636057] [ 1.03800476 0.10644536] [-0.29484182 -0.1249576 ] [-0.05250608 -0.8191665 ] [-0.29484182 -1.28197243] [ 0.31099753 -0.58776353] [ 0.31099753 -0.58776353] [ 0.67450115 -0.35636057] [ 0.91683689 -0.1249576 ] [ 1.03800476 -0.1249576 ] [ 0.18982966 -0.35636057] [-0.17367395 -1.05056946] [-0.41600969 -1.51337539] [-0.41600969 -1.51337539] [-0.05250608 -0.8191665 ] [ 0.18982966 -0.8191665 ] [-0.53717756 -0.1249576 ] [ 0.18982966 0.80065426] [ 1.03800476 0.10644536] [-0.29484182 -0.1249576 ] [-0.41600969 -1.28197243] [-0.41600969 -1.05056946] [ 0.31099753 -0.1249576 ] [-0.05250608 -1.05056946] [-0.29484182 -0.8191665 ] [-0.17367395 -0.1249576 ] [-0.17367395 -0.35636057] [ 0.4321654 -0.35636057] [-0.17367395 -0.58776353] [ 0.55333328 0.56925129] [-0.05250608 -0.8191665 ] [ 1.52267624 -0.1249576 ] [ 0.55333328 -0.35636057] [ 0.79566902 -0.1249576 ] [ 1.76501198 -0.35636057] [ 0.79566902 0.33784833] [ 0.67450115 -0.8191665 ] [ 1.15917263 -0.1249576 ] [-0.17367395 -1.28197243] [-0.05250608 -0.58776353] [ 0.67450115 0.33784833] [ 0.79566902 -0.1249576 ] [ 1.2803405 0.33784833] [-0.29484182 -0.58776353] [ 0.55333328 -0.8191665 ] [ 1.03800476 0.56925129] [ 1.64384411 0.33784833] [ 0.4321654 -0.58776353] [ 0.31099753 -0.1249576 ] [ 0.67450115 -0.58776353] [ 1.64384411 -0.1249576 ] [ 0.67450115 -0.58776353] [ 0.55333328 -0.58776353] [ 0.31099753 -1.05056946] [ 0.55333328 0.80065426] [ 0.67450115 0.10644536] [ 0.18982966 -0.1249576 ] [ 1.2803405 0.10644536] [ 1.03800476 0.10644536] [ 1.2803405 0.10644536] [-0.05250608 -0.8191665 ] [ 1.15917263 0.33784833] [ 1.03800476 0.56925129] [ 1.03800476 -0.1249576 ] [ 0.79566902 -0.1249576 ] [ 0.4321654 0.80065426] [ 0.06866179 -0.1249576 ]] [[-1.3412724 -1.31297673] [-1.3412724 -1.31297673] [-1.39813811 -1.31297673] [-1.2844067 -1.31297673] [-1.3412724 -1.31297673] [-1.17067529 -1.05003079] [-1.3412724 -1.18150376] [-1.2844067 -1.31297673] [-1.3412724 -1.31297673] [-1.2844067 -1.4444497 ] [-1.2844067 -1.31297673] [-1.227541 -1.31297673] [-1.3412724 -1.4444497 ] [-1.51186952 -1.4444497 ] [-1.45500381 -1.31297673] [-1.2844067 -1.05003079] [-1.39813811 -1.05003079] [-1.3412724 -1.18150376] [-1.17067529 -1.18150376] [-1.2844067 -1.18150376] [-1.17067529 -1.31297673] [-1.2844067 -1.05003079] [-1.56873522 -1.31297673] [-1.17067529 -0.91855782] [-1.05694388 -1.31297673] [-1.227541 -1.31297673] [-1.227541 -1.05003079] [-1.2844067 -1.31297673] [-1.3412724 -1.31297673] [-1.227541 -1.31297673] [-1.227541 -1.31297673] [-1.2844067 -1.05003079] [-1.2844067 -1.4444497 ] [-1.3412724 -1.31297673] [-1.2844067 -1.4444497 ] [-1.45500381 -1.31297673] [-1.39813811 -1.31297673] [-1.2844067 -1.4444497 ] [-1.39813811 -1.31297673] [-1.2844067 -1.31297673] [-1.39813811 -1.18150376] [-1.39813811 -1.18150376] [-1.39813811 -1.31297673] [-1.227541 -0.78708485] [-1.05694388 -1.05003079] [-1.3412724 -1.18150376] [-1.227541 -1.31297673] [-1.3412724 -1.31297673] [-1.2844067 -1.31297673] [-1.3412724 -1.31297673] [ 0.53529583 0.26469891] [ 0.42156442 0.39617188] [ 0.64902723 0.39617188] [ 0.1372359 0.13322594] [ 0.47843012 0.39617188] [ 0.42156442 0.13322594] [ 0.53529583 0.52764485] [-0.26082403 -0.26119297] [ 0.47843012 0.13322594] [ 0.08037019 0.26469891] [-0.14709262 -0.26119297] [ 0.25096731 0.39617188] [ 0.1372359 -0.26119297] [ 0.53529583 0.26469891] [-0.09022692 0.13322594] [ 0.36469871 0.26469891] [ 0.42156442 0.39617188] [ 0.1941016 -0.26119297] [ 0.42156442 0.39617188] [ 0.08037019 -0.12972 ] [ 0.59216153 0.79059079] [ 0.1372359 0.13322594] [ 0.64902723 0.39617188] [ 0.53529583 0.00175297] [ 0.30783301 0.13322594] [ 0.36469871 0.26469891] [ 0.59216153 0.26469891] [ 0.70589294 0.65911782] [ 0.42156442 0.39617188] [-0.14709262 -0.26119297] [ 0.02350449 -0.12972 ] [-0.03336121 -0.26119297] [ 0.08037019 0.00175297] [ 0.76275864 0.52764485] [ 0.42156442 0.39617188] [ 0.42156442 0.52764485] [ 0.53529583 0.39617188] [ 0.36469871 0.13322594] [ 0.1941016 0.13322594] [ 0.1372359 0.13322594] [ 0.36469871 0.00175297] [ 0.47843012 0.26469891] [ 0.1372359 0.00175297] [-0.26082403 -0.26119297] [ 0.25096731 0.13322594] [ 0.25096731 0.00175297] [ 0.25096731 0.13322594] [ 0.30783301 0.13322594] [-0.43142114 -0.12972 ] [ 0.1941016 0.13322594] [ 1.27454998 1.71090158] [ 0.76275864 0.92206376] [ 1.21768427 1.1850097 ] [ 1.04708716 0.79059079] [ 1.16081857 1.31648267] [ 1.6157442 1.1850097 ] [ 0.42156442 0.65911782] [ 1.44514709 0.79059079] [ 1.16081857 0.79059079] [ 1.33141568 1.71090158] [ 0.76275864 1.05353673] [ 0.87649005 0.92206376] [ 0.99022146 1.1850097 ] [ 0.70589294 1.05353673] [ 0.76275864 1.57942861] [ 0.87649005 1.44795564] [ 0.99022146 0.79059079] [ 1.67260991 1.31648267] [ 0.70589294 0.39617188] [ 1.10395287 1.44795564] [ 0.64902723 1.05353673] [ 1.67260991 1.05353673] [ 0.64902723 0.79059079] [ 1.10395287 1.1850097 ] [ 1.27454998 0.79059079] [ 0.59216153 0.79059079] [ 0.64902723 0.79059079] [ 1.04708716 1.1850097 ] [ 1.16081857 0.52764485] [ 1.33141568 0.92206376] [ 1.50201279 1.05353673] [ 1.04708716 1.31648267] [ 0.76275864 0.39617188] [ 1.04708716 0.26469891] [ 1.33141568 1.44795564] [ 1.04708716 1.57942861] [ 0.99022146 0.79059079] [ 0.59216153 0.79059079] [ 0.93335575 1.1850097 ] [ 1.04708716 1.57942861] [ 0.76275864 1.44795564] [ 0.76275864 0.92206376] [ 1.21768427 1.44795564] [ 1.10395287 1.71090158] [ 0.81962435 1.44795564] [ 0.70589294 0.92206376] [ 0.81962435 1.05353673] [ 0.93335575 1.44795564] [ 0.76275864 0.79059079]]
for agg in agglom:
plt.hist(agg.distances_,30)
plt.show()
for kmean in kmean_array:
plt.hist(kmean.inertia_,30)
plt.show()
Plot hierarchical clustering dendrogram.
The code below is adapted from https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html#sphx-glr-auto-examples-cluster-plot-agglomerative-dendrogram-py
from scipy.cluster.hierarchy import dendrogram
def plot_dendrogram(model, **kwargs):
# Create linkage matrix and then plot the dendrogram
# create the counts of samples under each node
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack([model.children_, model.distances_,
counts]).astype(float)
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
for agg in agglom:
plt.title('Hierarchical Clustering Dendrogram')
# plot the top three levels of the dendrogram
plot_dendrogram(agg, truncate_mode='level', p=3)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()
Inference¶
For each datapaint, add the cluster to the original Iris data set.
for i, agg in enumerate(agglom):
df_iris[F'Cluster {i}'] = agg.labels_.astype(str)
df_iris[F'Cluster {i}'] = 'Cluster ' + df_iris[F'Cluster {i}']
for kmean in kmean_array:
i += 1
df_iris[F'Cluster {i}'] = kmean.labels_.astype(str)
df_iris[F'Cluster {i}'] = 'Cluster ' + df_iris[F'Cluster {i}']
for DBscan in DBSCAN_array:
i += 1
df_iris[F'Cluster {i}'] = DBscan.labels_.astype(str)
df_iris[F'Cluster {i}'] = 'Cluster ' + df_iris[F'Cluster {i}']
display(df_iris.head(5))
clusterattampts = i;
| Sepal Length | Sepal Width | Petal Length | Petal Width | Species | Cluster 0 | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 | Cluster 7 | Cluster 8 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | Cluster 1 | Cluster 1 | Cluster 1 | Cluster 0 | Cluster 0 | Cluster 2 | Cluster 0 | Cluster 0 | Cluster 0 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa | Cluster 1 | Cluster 1 | Cluster 1 | Cluster 0 | Cluster 0 | Cluster 2 | Cluster 0 | Cluster 0 | Cluster 0 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | Cluster 1 | Cluster 1 | Cluster 1 | Cluster 0 | Cluster 0 | Cluster 2 | Cluster 0 | Cluster 0 | Cluster 0 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | Cluster 1 | Cluster 1 | Cluster 1 | Cluster 0 | Cluster 0 | Cluster 2 | Cluster 0 | Cluster 0 | Cluster 0 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa | Cluster 1 | Cluster 1 | Cluster 1 | Cluster 0 | Cluster 0 | Cluster 2 | Cluster 0 | Cluster 0 | Cluster 0 |
display(df_iris.tail(5))
| Sepal Length | Sepal Width | Petal Length | Petal Width | Species | Cluster 0 | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 | Cluster 7 | Cluster 8 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica | Cluster 2 | Cluster 0 | Cluster 0 | Cluster 2 | Cluster 2 | Cluster 1 | Cluster 2 | Cluster 1 | Cluster 1 |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica | Cluster 0 | Cluster 0 | Cluster 0 | Cluster 1 | Cluster 1 | Cluster 1 | Cluster -1 | Cluster 1 | Cluster 1 |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica | Cluster 2 | Cluster 0 | Cluster 0 | Cluster 2 | Cluster 2 | Cluster 1 | Cluster 2 | Cluster 1 | Cluster 1 |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica | Cluster 2 | Cluster 0 | Cluster 0 | Cluster 2 | Cluster 2 | Cluster 1 | Cluster -1 | Cluster 1 | Cluster 1 |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica | Cluster 0 | Cluster 0 | Cluster 0 | Cluster 1 | Cluster 1 | Cluster 1 | Cluster -1 | Cluster 1 | Cluster 1 |
Plot pairwise relationships per species and per cluster using method seaborn.pairplot.
plt.figure(figsize=(8,8))
ax = sns.pairplot(df_iris[["Sepal Length","Sepal Width","Petal Length","Petal Width","Species"]], hue="Species")
plt.show
for i in range(0, clusterattampts + 1):
plt.figure(figsize=(8,8))
ax = sns.pairplot(df_iris[["Sepal Length","Sepal Width","Petal Length","Petal Width", F'Cluster {i}']], hue=F'Cluster {i}')
plt.show
<Figure size 800x800 with 0 Axes>
<Figure size 800x800 with 0 Axes>
<Figure size 800x800 with 0 Axes>
<Figure size 800x800 with 0 Axes>
<Figure size 800x800 with 0 Axes>
<Figure size 800x800 with 0 Axes>
<Figure size 800x800 with 0 Axes>
<Figure size 800x800 with 0 Axes>
<Figure size 800x800 with 0 Axes>
<Figure size 800x800 with 0 Axes>
Evaluation¶
If clustering is successful, one may expect that flowers of the same species end up in the same cluster. Let us check whether this is the case.
Code for the bar plot is adapted from https://matplotlib.org/stable/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-lines-bars-and-markers-barchart-py
# Define labels for species
species = df_iris['Species'].unique()
for i in range(0, clusterattampts + 1):
# Define labels for clusters
clusters = df_iris[F'Cluster {i}'].unique()
# Sort cluster names in alphabetical order, i.e.,
# Cluster 0, Cluster 1, Cluster 2, etc.
clusters.sort()
# Determine the location for cluster labels
x = np.arange(len(clusters))
# Define the width of the bars
width = 0.25
# Create the bar plot
fig, ax = plt.subplots()
offset = -width
for spec in species:
nr_occurrences = []
for clus in clusters:
nr = df_iris[(df_iris['Species']==spec) & (df_iris[F'Cluster {i}']==clus)][F'Cluster {i}'].count()
nr_occurrences.append(nr)
rects = ax.bar(x + offset, nr_occurrences, width, label=spec)
offset = offset + width
# Add text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Number of occurrences')
ax.set_title(f'Number of occurrences in cluster using method: {i}')
ax.set_xticks(x)
ax.set_xticklabels(clusters)
ax.legend()
fig.tight_layout()
plt.show()
it seems after clustering with all three (0,1,2) possabilities, option 2 with 'Petal Length'and 'Petal Width' seems to do best.
but when i use kmean (3,4,5) it is doesn't if you take 'Petal Length'and 'Petal Width' or 'Sepal Length', 'Sepal Width'. they are close enough to each other to say thay can perform the same.
print(df_iris['Species'].value_counts())
for i in range(0, clusterattampts + 1):
print(F'\nusing method: {i}:')
print(df_iris[F'Cluster {i}'].value_counts())
species = df_iris['Species'].unique()
for spec in species:
print('Number of samples per cluster for',spec)
print(df_iris[df_iris['Species']==spec][F'Cluster {i}'].value_counts())
Species Iris-setosa 50 Iris-versicolor 50 Iris-virginica 50 Name: count, dtype: int64 using method: 0: Cluster 0 Cluster 0 64 Cluster 1 50 Cluster 2 36 Name: count, dtype: int64 Number of samples per cluster for Iris-setosa Cluster 0 Cluster 1 50 Name: count, dtype: int64 Number of samples per cluster for Iris-versicolor Cluster 0 Cluster 0 49 Cluster 2 1 Name: count, dtype: int64 Number of samples per cluster for Iris-virginica Cluster 0 Cluster 2 35 Cluster 0 15 Name: count, dtype: int64 using method: 1: Cluster 1 Cluster 0 94 Cluster 1 56 Name: count, dtype: int64 Number of samples per cluster for Iris-setosa Cluster 1 Cluster 1 50 Name: count, dtype: int64 Number of samples per cluster for Iris-versicolor Cluster 1 Cluster 0 45 Cluster 1 5 Name: count, dtype: int64 Number of samples per cluster for Iris-virginica Cluster 1 Cluster 0 49 Cluster 1 1 Name: count, dtype: int64 using method: 2: Cluster 2 Cluster 0 54 Cluster 1 50 Cluster 2 46 Name: count, dtype: int64 Number of samples per cluster for Iris-setosa Cluster 2 Cluster 1 50 Name: count, dtype: int64 Number of samples per cluster for Iris-versicolor Cluster 2 Cluster 2 45 Cluster 0 5 Name: count, dtype: int64 Number of samples per cluster for Iris-virginica Cluster 2 Cluster 0 49 Cluster 2 1 Name: count, dtype: int64 using method: 3: Cluster 3 Cluster 1 56 Cluster 0 50 Cluster 2 44 Name: count, dtype: int64 Number of samples per cluster for Iris-setosa Cluster 3 Cluster 0 50 Name: count, dtype: int64 Number of samples per cluster for Iris-versicolor Cluster 3 Cluster 1 39 Cluster 2 11 Name: count, dtype: int64 Number of samples per cluster for Iris-virginica Cluster 3 Cluster 2 33 Cluster 1 17 Name: count, dtype: int64 using method: 4: Cluster 4 Cluster 1 57 Cluster 0 51 Cluster 2 42 Name: count, dtype: int64 Number of samples per cluster for Iris-setosa Cluster 4 Cluster 0 49 Cluster 1 1 Name: count, dtype: int64 Number of samples per cluster for Iris-versicolor Cluster 4 Cluster 1 36 Cluster 2 12 Cluster 0 2 Name: count, dtype: int64 Number of samples per cluster for Iris-virginica Cluster 4 Cluster 2 30 Cluster 1 20 Name: count, dtype: int64 using method: 5: Cluster 5 Cluster 0 52 Cluster 2 50 Cluster 1 48 Name: count, dtype: int64 Number of samples per cluster for Iris-setosa Cluster 5 Cluster 2 50 Name: count, dtype: int64 Number of samples per cluster for Iris-versicolor Cluster 5 Cluster 0 48 Cluster 1 2 Name: count, dtype: int64 Number of samples per cluster for Iris-virginica Cluster 5 Cluster 1 46 Cluster 0 4 Name: count, dtype: int64 using method: 6: Cluster 6 Cluster -1 57 Cluster 0 40 Cluster 1 38 Cluster 2 15 Name: count, dtype: int64 Number of samples per cluster for Iris-setosa Cluster 6 Cluster 0 40 Cluster -1 10 Name: count, dtype: int64 Number of samples per cluster for Iris-versicolor Cluster 6 Cluster 1 37 Cluster -1 13 Name: count, dtype: int64 Number of samples per cluster for Iris-virginica Cluster 6 Cluster -1 34 Cluster 2 15 Cluster 1 1 Name: count, dtype: int64 using method: 7: Cluster 7 Cluster 1 83 Cluster 0 43 Cluster -1 24 Name: count, dtype: int64 Number of samples per cluster for Iris-setosa Cluster 7 Cluster 0 43 Cluster -1 7 Name: count, dtype: int64 Number of samples per cluster for Iris-versicolor Cluster 7 Cluster 1 42 Cluster -1 8 Name: count, dtype: int64 Number of samples per cluster for Iris-virginica Cluster 7 Cluster 1 41 Cluster -1 9 Name: count, dtype: int64 using method: 8: Cluster 8 Cluster 1 100 Cluster 0 50 Name: count, dtype: int64 Number of samples per cluster for Iris-setosa Cluster 8 Cluster 0 50 Name: count, dtype: int64 Number of samples per cluster for Iris-versicolor Cluster 8 Cluster 1 50 Name: count, dtype: int64 Number of samples per cluster for Iris-virginica Cluster 8 Cluster 1 50 Name: count, dtype: int64